Skip to content

revision: use priority queue for streaming walks#2127

Open
spkrka wants to merge 3 commits into
gitgitgadget:masterfrom
spkrka:streaming-prio-queue
Open

revision: use priority queue for streaming walks#2127
spkrka wants to merge 3 commits into
gitgitgadget:masterfrom
spkrka:streaming-prio-queue

Conversation

@spkrka
Copy link
Copy Markdown

@spkrka spkrka commented May 27, 2026

This is a follow-up to kk/limit-list-optim (now in master), which
replaced the O(N) sorted linked-list insertion in limit_list() with
a priority queue. In the review thread for that patch, I
mentioned that the same approach could be applied to the streaming
(non-limited) walk in get_revision_1(). Junio suggested doing it
as a separate topic, and Peff noted he had a local branch
(jk/revs-commits-prio-queue) doing a broader conversion of
revs->commits to prio_queue entirely.

This series takes a lighter-weight approach: it keeps the linked
list for setup and external callers, and adds a separate
commit_queue field that the streaming walk drains into on first use.
This avoids touching bisect, line-log, and list-objects code, at
the cost of a "only one should be non-empty" invariant between the
two fields.

Together with the limit_list() change already in master, this
eliminates all commit_list_insert_by_date() callers from
revision.c.

Patch 1 is a small leak fix -- a missing release_revisions() call
in pack-objects that becomes visible once the commit queue uses a
dynamically allocated prio_queue array.

Patch 2 introduces a rev_walk_mode enum to replace the repeated
if/else chains in get_revision_1(). The function dispatches on
walk mode in multiple places (next commit, expand parents, flag
clearing) and these chains must stay in sync. The enum resolves
the mode once and both dispatch sites switch on the same variable.
This is a lighter alternative to the vtable-based refactoring I
mentioned before. No functional change.

Patch 3 is the actual conversion of the streaming walk to use a
priority queue.

== Why this helps ==

The streaming walk in get_revision_1() inserts newly discovered
parent commits into a date-sorted queue. On master, this uses
commit_list_insert_by_date(), which walks the linked list to find
the insertion point -- O(w) per insert, where w is the queue width
(active walk frontier).

In merge-heavy repositories, the walk frontier stays wide:

Repository Commits Peak width Avg width

monorepo (2.4M) 2,420K 2,653 1,700
linux.git 1,445K 581 235
git.git 82K 188 82

On the monorepo, each of the 2.4M commits requires scanning an
average of 1,700 list entries to find the insertion point. With
the priority queue, this drops to ~11 heap comparisons.

== Benchmarks ==

All benchmarks: best of 3 runs, same machine, commit-graph
present.

Streaming walks (affected by this series):

git rev-list --count HEAD (monorepo, 2.4M commits)

  master:   17.94s
  patched:   3.38s   (5.3x faster)

git rev-list HEAD (monorepo, full output)

  master:   27.72s
  patched:   8.61s   (2.8x faster, I/O-bound fraction unchanged)

Regression checks -- non-merge-heavy repos (streaming path,
but frontier stays narrow so O(w) insertion was never the
bottleneck):

git rev-list --count HEAD (linux.git, 1.4M commits)

  master:    1.76s
  patched:   1.81s   (no change)

git rev-list HEAD (linux.git, full output)

  master:    4.46s
  patched:   4.52s   (no change)

git rev-list --count HEAD (git.git, 82K commits)

  master:     83ms
  patched:    86ms   (no change)

Regression checks -- other walk modes (not affected by this
series):

git rev-list --count HEAD~5000...HEAD (monorepo, limited path)

  master:    7.36s
  patched:   7.02s   (no change)

== Profile breakdown ==

perf profiling of rev-list --count HEAD on the monorepo shows
where the time goes:

master (17.94s):
commit_list_insert_by_date 79% 14.25s
fixed overhead (parse/lookup) 21% 3.69s

patched (3.38s):
heap ops (compare + sift) 16% 0.53s
fixed overhead (parse/lookup) 84% 2.85s

The queue maintenance itself sped up 27x (14.25s to 0.53s).
The overall 5.3x is lower because the fixed costs -- object
lookup (17%), commit-graph parsing (14%), memory allocation
(10%) -- are roughly constant between the two versions at ~3s.

This means the patch removes the dominant bottleneck entirely.
After the patch, the walk cost is dominated by irreducible
per-commit work (parsing and object lookup) which scales
linearly with commit count regardless of frontier width.

spkrka added 3 commits May 27, 2026 17:11
enumerate_and_traverse_cruft_objects() initializes a rev_info on the
stack but never calls release_revisions() afterwards.  This is not
visible on master but becomes a leak once the revision walking
machinery uses dynamically allocated structures.

Add the missing release_revisions() call.

Signed-off-by: Kristofer Karlsson <krka@spotify.com>
get_revision_1() dispatches to different walk strategies based on a
combination of rev_info flags: reflog_info, topo_walk_info, and
limited.  These conditions are checked in multiple places within
the function -- once to select the next commit, and again to decide
how to expand parents -- and the two chains must stay in sync.

Extract the mode selection into a rev_walk_mode enum and a small
get_walk_mode() helper, resolved once at the top of get_revision_1().
Both dispatch sites now switch on the same mode variable, making it
obvious that they agree and easier to verify that all modes are
handled.

No functional change.

Signed-off-by: Kristofer Karlsson <krka@spotify.com>
The streaming (non-limited) walk in get_revision_1() inserts newly
discovered parent commits into a date-sorted queue via
commit_list_insert_by_date(), which scans the linked list to find the
insertion point -- O(w) per insert, where w is the width of the active
walk frontier.  Replace this with an O(log w) priority queue.

Add a commit_queue field to rev_info alongside the existing commits
linked list.  The two representations are mutually exclusive: setup
and external callers that need list access use the linked list, then
get_revision_1() lazily drains it into the priority queue on first
call.  Add a REV_WALK_NO_WALK enum value to distinguish the no_walk
case (which still uses the commit list) from the streaming case.

The conversion function rev_info_commit_list_to_queue() is public so
callers that know they will iterate can convert early.

Combined with the limit_list() priority queue change already in
master, this eliminates all O(w) sorted linked-list insertion from
the revision walk machinery.

Signed-off-by: Kristofer Karlsson <krka@spotify.com>
@spkrka spkrka force-pushed the streaming-prio-queue branch from 58004d9 to c0e7870 Compare May 27, 2026 15:19
@spkrka spkrka marked this pull request as ready for review May 27, 2026 15:48
@spkrka
Copy link
Copy Markdown
Author

spkrka commented May 27, 2026

/submit

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget Bot commented May 27, 2026

Submitted as pull.2127.git.1779897003.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-2127/spkrka/streaming-prio-queue-v1

To fetch this version to local tag pr-2127/spkrka/streaming-prio-queue-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-2127/spkrka/streaming-prio-queue-v1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant